Hashing¶

image.png

Remember Binary Search?¶

image.png

Binary Search Analysis¶

$$ T(n) = \left\{ \begin{array}\\ c & \mbox{if } \ n = 1 \\ c+T(n/2) & \mbox{if } \ n > 1 \\ \end{array} \right.$$

$T(n) = c + T(n/2)) \implies T(n) = c + [c + T(n/4)] \implies T(n) = 2c + T(n/4)$

$T(n) = 2c + [c + T(n/8)] \implies T(n) = 3c + T(n/8)$

$T(n) = 3c + T(n/2^3)$

$T(n) = 3c + [c + T(n/2^4)] \implies T(n) = 4c + T(n/2^4)$

Binary Search Analysis...¶

$T(n) = 3c + [c + T(n/2^4)] \implies T(n) = 4c + T(n/2^4)$

When will we reach T(1) ?¶
Suppose $2^k = n,$¶

$T(n) = kc + T(n/2^k)$

$T(n) = (k+1)c$

That is $k = \log_2(n)$¶

$T(n) = O(\log n)$

Searching¶

Can we improve the search operation to achieve better than $O(\log n)$ time?¶

Searching based on Comparisons¶

image-3.png

Searching based on comparisons¶

  • $N$ elements have to be searched.

  • That requires how many pairwise comparisons?

  • Depth of the decision structure is...?

    • Level 0: 1
    • Level 1: 2
    • Level 2: 4
    • Level 3: 8

Searching based on comparisons¶

  • At level $k$, I have $N$ possibilities

    $\implies 2^k = N$

    $\implies k = \log_2(N)$

  • $k \approx (\log N)$

Remember Arrays?¶

Given a location $i$, we can access $A[i]$ in $O(1)$ time.¶

Remember Arrays?¶

Given a location $i$, we can access $A[i]$ in $O(1)$ time.¶

Can we do the reverse?!¶

Remember Arrays?¶

Given a location $i$, we can access $A[i]$ in $O(1)$ time.¶

Can we do the reverse?!¶

  • Given $A[i]$, can we figure out $i$?¶
  • Given an element, can we decide the index it should be placed at?¶

Suppose we have a set of products with product IDs: $[1100,\ldots,1199] $

We have an array $A[0] - A[99]$.

  • $A[0] : 1100$
  • $A[1] : 1101$
  • $A[2] : 1102$
  • ...
  • $A[99]: 1199$
How did we get here?¶
  • $A[0] : 1100$

  • $A[1] : 1101$

  • $A[2] : 1102$

  • ...

  • $A[99]: 1199$

  • 1101%1100 = 1

  • 1102%1100 = 2

  • 1127%1100 = 27

Hashing¶

Hashing is the process of mapping a search key to a limited range of array indices with the goal of providing direct access to the keys.¶

Assumptions

  • Elements are called keys
  • A key can take any integer value
  • We cannot create an array large enough to store all possible integer values.

Hashing¶

  • The keys are stored in an array called a hash table
  • A hash function maps the search keys to specific entries in the table.

Hashing¶

  • Keys: 765, 431, 96, 142, 579, 226, 903, 388

  • Hash Table T contains M = 13 elements

  • Hash Function $h$ (key) = key % $M$

In [1]:
keys=[765, 431, 96, 142, 579, 226, 903, 388]
M = 13

for key in keys:
    print(key,"-->",key%M)
765 --> 11
431 --> 2
96 --> 5
142 --> 12
579 --> 7
226 --> 5
903 --> 6
388 --> 11
  • 96 --> 5

  • 226 --> 5

  • Entry [5] already contains 96

  • This is called a Collision

Collision: Linear Probing¶

image.png

Linear Probing¶

  • We must resolve the collision by..
  • probing the table to find another available slot.
  • using a linear probe, which
    • examines the table entries in sequential order
    • starting with the first entry immediately following the original hash location.

image.png

Linear Probing¶

image-2.png

image.png

image.png

keys = [765, 431, 96, 142, 579, 226, 903, 388]¶

903 % 13 = 6¶

226 is already at 6¶

image.png

  • While probing, if the end of the array is reached, wrap around.

Searching¶

  • Same as insert.

    • Find the hash value
    • Start a linear search from there on, till we encounter the number or a null (empty) value. image.png image-2.png

Deletions¶

  • Search the key to be deleted.

  • Simply remove it by setting the corresponding table entry to None.

  • This does not work. Why?

Insert 903¶

image.png

Delete 226¶

image-2.png

Search 903¶

image-3.png

image.png

Have an indicator that this entry was deleted.¶

Clustering¶

  • As more keys are added to the hash table, more collisions are likely to occur.

  • Since each collision requires a linear probe to find the next available slot, the keys begin to form clusters.

  • As the clusters grow larger, so too does the probability that the next key added to the table will result in a collision.

Clustering¶

  • Empty Table: Each slot has a probability of 1/13 $(1/M)$.

image.png

  • What is the probability the next key will occupy the empty slot at position 4?

  • What is the probability the next key will occupy the empty slot at position 9?

Clustering¶

  • Each empty slot has a probability of 1/13 $(1/M)$.

image.png

  • What is the probability the next key will occupy the empty slot at position 4?

    • 1/13
  • What is the probability the next key will occupy the empty slot at position 9?

    • 5/13

Clustering¶

  • This type of clustering is known as primary clustering since it occurs near the original hash position.

  • As the clusters grow larger, so too does the length of the search needed to find the next available slot.

Clustering: Modified Linear Probe¶

  • When probing to find the next available slot, a loop is used to iterate through the table entries.

  • The order in which the entries are visited form a probe sequence.

    • slot = ( home + $i * c$ ) % $M$, where $i=0,1,2,…$
      • where $i$ is the $i^{th}$ probe in the sequence
      • home is the original position $(i = 0)$:
        home = (key % $M$)

Clustering: Modified Linear Probe¶

765, 431, 96, 142, 579, 226, 903, 388¶

image-2.png

5 collisions¶

Clustering: Modified Linear Probe¶

  • We can improve linear probe by skipping over multiple locations.

    • slot = ( home + $i * c$ ) % $M$, where $i=0,1,2,…$
    • $c$ is a constant
  • If $c = 3$, image.png

2 collisions only.¶

How do we choose c?¶

Clustering: Modified Linear Probe¶

  • $c$ and $M$ must be coprime

    • GCD$(c,M) = 1 $ (Greatest Common Divisor / Highest Common Factor)
    • Why?
  • If $M$ is prime, then $c$ can take any value.

  • If $M=13$ and $c=2$, and a key hashes to 2, the visit order is:

    • 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11, 0

Clustering: Modified Linear Probe¶

  • $c$ and $M$ must be coprime

    • GCD$(c,M) = 1 $ (Greatest Common Divisor / Highest Common Factor)
    • Why?
  • If $M$ is prime, then $c$ can take any value.

  • If $M=13$ and $c=2$, and a key hashes to 2, the visit order is:

    • 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11, 0
  • If $M=13$ and $c=3$, the visit order is:

    • 5, 8, 11, 1, 4, 7, 10, 0, 3, 6, 9, 12

Clustering: Modified Linear Probe¶

  • $c$ and $M$ must be coprime

    • GCD$(c,M) = 1 $ (Greatest Common Divisor / Highest Common Factor)
    • Why?
  • If $M$ is prime, then $c$ can take any value.

  • If $M=13$ and $c=2$, and a key hashes to 2, the visit order is:

    • 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11, 0
  • If $M=13$ and $c=3$, the visit order is:

    • 5, 8, 11, 1, 4, 7, 10, 0, 3, 6, 9, 12
  • If $M=10$ and $c=2$, the visit order is:

    • 4, 6, 8, 0, 2, 4, 6, 8, 0

Quadratic Probing¶

  • slot = ( home + $i^2 $) % $M$

  • Quadratic probing eliminates primary clustering by increasing the distance between each probe in the sequence.

  • Introduces the problem of secondary clustering

    • when two keys map to the same table entry and have the same probe sequence.
    • for example, 388 and 258 (both hash to slot 11).
  • No guarantee the quadratic probe will visit every entry in the table.

  • If $M$ is prime, then at least half of them will be visited.

Double Hashing¶

  • The quadratic probe distributes the keys by increasing steps in the probe sequence.

  • But the same sequence is followed by multiple keys that map to the same table entry (secondary clustering)

  • This occurs because the probe equation is based solely on the original hash slot.

  • In double hashing, when a collision occurs, the key is hashed by a second function and the result is used as the constant factor in the linear probe:

    • slot = ( home + $i * hp\ (key))\ \ $ % $M$

Double Hashing¶

  • In double hashing, when a collision occurs, the key is hashed by a second function and the result is used as the constant factor in the linear probe:

    • slot = ( home + $i * hp\ (key) )\ \ $ % $M$
  • While the step size remains constant throughout the probe,

    • multiple keys that map to the same table entry will have different probe sequences.

Double Hashing¶

  • To reduce clustering, the second hash function $hp(\cdot)$ should not be the same as the main hash function:

    • $hp(\cdot)\not= h(\cdot)$
  • It should produce a valid index in the range $0 < c < M$.

  • A simple choice for the second hash function takes the form:

    • $hp(key) = 1 + key\ \% P$
    • $P < M$

Double Hashing¶

  • In double hashing, when a collision occurs, the key is hashed by a second function and the result is used as the constant factor in the linear probe:

    • slot = [ home + $(i * (1 + key\%P)) ]\ \ \% M$
  • While the step size remains constant throughout the probe,

    • multiple keys that map to the same table entry will have different probe sequences.

M = 13, P = 8¶

image-2.png

  • 226%13 = 5

    • 226%8 = 2
    • 5 + (1 + 2) = 8
    • // slot = [ home + $(i * (1 + key\%P)) ]\ \ \% M$
  • 388%13 = 11

    • 388%8 = 4
    • 11 + (1 + 4) = 3 (modulo 13)
2 collisions¶

Double Hashing¶

  • Used to resolve collisions since it reduces both primary and secondary clustering.

  • To ensure every table entry is visited during the probing, the table size must be a prime number.

Rehashing¶

  • How to decide how big a hash table should be?

  • After some time, we may need to make room for more entries

  • We need a new hash table

  • Cannot simply copy -- we have to rebuild or rehash the entire table

M = 13¶

image.png

M = 17¶

image-2.png

image.png

  • Experience has shown that hashing works best when the table is no more than $\approx \frac{3}{4}$ full.

  • if the hash table is to be expanded, it should be done before the table becomes full.

  • The ratio between the number of keys in the hash table and the size of the table is called the load factor.

  • In practice, a hash table should be expanded before the load factor reaches 80%.

  • Rule of Thumb: Double the table size $(2M + 1)$.

Average search times for linear and quadratic probes¶

image.png

For load factor $\geq\frac{2}{3}$, the average number of comparisons become very large, especially for an unsuccessful search.¶

Separate Chaining¶

  • We can eliminate collisions if we use linked lists.

image-2.png

  • The linked lists are commonly referred to as chains
    • This technique of collision resolution is known as separate chaining.

Separate Chaining¶

  • Hash table is constructed as an array of linked lists.

  • New keys can be prepended to the linked list since the nodes are in no particular order.

  • Deleting is straight forward

Separate Chaining¶

  • Also known as open hashing since the keys are stored outside the table (not in the array).

    • closed addressing to describe open hashing, and
    • open addressing is used to describe closed hashing
  • The "addressing" term refers to the possible locations of the keys in relation to the table entries.

    • In open addressing, the keys may have been stored in an open slot different from the one to which it originally mapped
    • while in closed addressing, the key is contained within the entry to which it mapped.

Hash Functions¶

  • A “perfect” hash function will map every key to a different table entry, resulting in no collisions.

    • Very hard to achieve, possible when:
      • keys are within a small range
      • keys are known beforehand
  • Goal: design a hash function that will distribute the keys across the range as evenly as possible.

Hash Functions - Desirable Properties¶

  • Easy to compute

  • Resulting index cannot be changing.

    • When applied multiple times to the same key, it must always return the same index value
  • If the key consists of multiple parts, every part should contribute in the computation of the resulting index value.

  • The table size should be a prime number, especially with the '%' operator.

Hash Functions¶

  • Division (Remainder)

    • $h(key) = key\ \%\ M$
  • Truncation

    • For very large numbers, some digits can be ignored.
      • example: 4873152 (7 digit number) in 0 - 999,
      • we can use location 1(8), 3(3) and 6(2), to get 832.

Hash Functions...¶

  • Folding
    • key is split into multiple parts.
    • combined into a single value by adding or multiplying the parts.
    • example: 4873152 ; split into 48, 731, and 52 ; add them up.
    • 48 + 731 + 52 = 831 (location)

String Hashing¶

  • Convert string representation to integer

  • Simplest approach is to sum the ASCII values of the individual characters.

    • works well with small hash tables.
    • pneumonoultramicroscopicsilicovolcanokoniosis
    • The total value of this is: 4888
  • Another approach is to use a polynomial:

    • $s_0a^{n-1} + s_1a^{n-2} + \ldots + s_{n-3}a^2 + s_{n-2}a + s_{n-1}$

      • $a$ is non-zero
      • $s_i$ is the ASCII value of the $i^{th}$ character in the string
      • $n$ is the length of the string
    • works well even if the strings are short